CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery

Creating #serverless data analytics
system on GCP using BigQuery
Márton Kodok / @martonkodok
Google Developer Expert at REEA.net
March 2018 - Tirgu Mures, Romania

● Geek. Hiker. Do-er.
● Among the Top3 romanians on Stackoverflow 120k reputation
● Google Developer Expert on Cloud technologies
● Crafting Web/Mobile backends at REEA.net
● BigQuery/Redis and database engine expert
● Active in mentoring and IT community
Twitter: @martonkodok
StackOverflow: pentium10
Slideshare: martonkodok
GitHub: pentium10
Creating #serverless data analytics system on GCP using BigQuery @martonkodok
About me

REEA.net uses GCP
Build on the same infrastructure
that powers Google

Google Cloud Platform (GCP)
Compute Big Data
BigQuery
Cloud
Dataflow
Cloud
Dataproc
Cloud
Datalab
Cloud
Pub/Sub
Genomics
Storage & Databases
Cloud
Storage
Cloud
Bigtable
Cloud
Datastore
Cloud SQL
Cloud
Spanner
Persistent
Disk
Machine Learning
Cloud Machine
Learning
Cloud
Vision API
Cloud
Speech API
Cloud Natural
Language API
Cloud
Translation
API
Cloud
Jobs API
Data
Studio
Cloud
Dataprep
Cloud Video
Intelligence
API
Advanced
Solutions Lab
Compute
Engine
App
Engine
Kubernetes
Engine
GPU
Cloud
Functions
Container-
Optimized OS
Identity & Security
Cloud IAM
Cloud Resource
Manager
Cloud Security
Scanner
Key
Management
Service
BeyondCorp
Data Loss
Prevention API
Identity-Aware
Proxy
Security Key
Enforcement
Internet of Things
Cloud IoT
Core
Transfer
Appliance

Google Cloud Platform (GCP)
Developer Tools
Cloud SDK
Cloud
Deployment
Manager
Cloud Source
Repositories
Cloud
Tools for
Android Studio
Cloud Tools
for IntelliJ
Cloud
Tools for
PowerShell
Cloud
Tools for
Visual Studio
Container
Registry
Google Plug-in
for Eclipse
Cloud Test
Lab
Networking
Virtual
Private Cloud
Cloud Load
Balancing
Cloud
CDN
Cloud
Interconnect
Cloud DNS
Cloud
Network
Cloud
External IP
Addresses
Cloud
Firewall Rules
Cloud
Routes
Cloud VPN
Management Tools
Stackdriver Monitoring Logging
Error
Reporting
Trace
Debugger
Cloud
Deployment
Manager
Cloud
Endpoints
Cloud
Console
Cloud
Shell
Cloud Mobile
App
Cloud
Billing API
Cloud
APIs
Cloud
Router
Dedicated
Interconnect
Container
Builder

Meet Serverless

Meet Serverless
serverless data center depicted

Event-driven serverless compute platform
Cloud
Services
Changes in data state
Business logic events
Integrations
Event Router
Gateway
HTTPS
Event Source
Multiple Platforms
Data Warehouse
Pub/Sub
Cloud Functions
Streaming
Business Value
Application
Task
Analysis

Serverless is about maximizing elasticity, cost
savings, and agility of cloud computing.
@martonkodok

Crafting a solution for building high-performance,
petabyte scale data analytics, serverless
reporting system on Google Cloud Platform
Goal today

Legacy Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances

Serverless Reporting System
App
Cloud Load
Balancing
NGINX
Compute Engine
10GB PD
2 1
Database Service (Master/Slave)
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Compute Engine
10GB PD
4 1
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
BigQuery Data Studio
Report & Share
Business Analysis

Analytics-as-a-Service - Data Warehouse in the Cloud
Scales into Petabytes on Managed Google Infrastructure (US or EU zone)
SQL 2011 + Javascript UDF (User Defined Functions)
Familiar DB Structure (table, views, struct, nested, JSON)
Integrates with Google Sheets + Cloud Storage + Pub/Sub connectors
Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Mar 2018
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
What is BigQuery?

Columnar storage (max 10 000 columns in table)
Large files for loading: 5TB (CSV or JSON)
UDF in Javascript or SQL
Rich SQL 2011: JSON,IP,Math,RegExp,Geocode,Window functions
Modern data types: Record, Nested, Struct, Array.
Append-only tables prefered (DML syntax available)
Day column partitioned tables (select * from t where day=’2018-01-01’)
BigQuery: Convenience of SQL

Architecting for The Cloud
BigQuery
On-Premises Servers
Pipelines
ETL
Engine
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming

“ Our project generates many/big files.
How can I seamlessly ingest them?

Serverless file ingest
BigQuery
On-Premises Servers
ApplicationEvent Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Cloud
Storage
Cloud
Functions
Triggered Code

“ Data needs to be processed in
multiple services.
How can we pipe to multiple places?

Architecting for The Cloud
On-Premises Servers
Event Sourcing
Frontend
Platform Services
Analyze
Metrics / Logs/
Streaming
Cloud Storage
Cloud
Dataflow
Process
BigQuery
Cloud SQL
Stream
Batch
Data
Studio
Third-Party
Tools

“ We have our app outside of GCP.
How can we use the benefits of BigQuery?

Data Pipeline Integration at REEA.net
Analytics Backend
BigQuery
On-Premises Servers
Pipelines
FluentD
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Development
Team
Data Analysts
Report & Share
Business Analysis
Tools
Tableau
QlikView
Data Studio
Internal
Dashboard
Database
SQL
Application
ServersServers
Cloud Storage
archive
Load
Export
Replay
Standard
Devices
HTTPS

The following slides will present a sample Fluentd configuration to:
1. Transform a record
2. Copy event to multiple outputs
3. Store event data in File (for backup/log purposes)
4. Stream to BigQuery (for immediate analyses)

<filter frontend.user.*>
@type record_transformer
</filter>
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
</store>
<store>
@type bigquery
</store>
…
</match>
Filter plugin mutates incoming data. Add/modify/delete
event data transform attributes without a code deploy.1
2
3
4
The copy output plugin copies events to multiple outputs.
File(s), multiple databases, DB engines.
Great to ship same event to multiple subsystems.
The Bigquery output plugin on the fly streams the event to
the BigQuery warehouse. No need to write integration.
Data is available immediately for querying.
Whenever needed other output plugins can be wired in:
Kafka, Google Cloud Storage output plugin.

record_transformer copy file BigQuery
<filter frontend.user.*>
@type record_transformer
enable_ruby
remove_keys host
<record>
bq {"insert_id":"${uid}","host":"${host}",
"created":"${time.to_i}"}
avg ${record["total"] / record["count"]}
</record>
</filter>
syntax: Ruby, easy to use.
Great for:
- date transformation,
- quick normalizations,
- calculating something on the fly,
and store in clear log/analytics db
- renaming without code deploy.
1 2 3 4

@type copy
<store>
@type forest
subtype file
<template>
path /tank/storage/${tag}.*.log
time_slice_format %Y%m%d
</template>
</store>
</match>
1 2 3 4

@type bigquery
method insert
auth_method json_key
json_key /etc/td-agent/keys/key-31da042be48c.json
time_field timestamp
time_slice_format %Y%m%d
table user$%{time_slice}
ignore_unknown_values
schema_path /etc/td-agent/schema/user_login.json
</match>
1 2 3 4
Connector uses:
- JSON key auth file
- JSON table schema
Pro features:
- streaming to Partitioned tables
- ignore unknown values
(not reflected in schema)

● On data that it is difficult to process/analyze using traditional databases
● Not a replacement to traditional DBs, but it compliments the system
● Major strength is handling Large datasets
● Applying Javascript UDF on columnar storage to resolve complex tasks
(eg: JS for natural language processing)
● On streams (forms, IoT, Kafka)
● On exploring unstructured data
Where to use BigQuery?

➢ Optimize product pages
➢ Email engagement
➢ Funnel Analysis
Achievements - goal reached by measuring everything

● Funnel Analysis
Achievements

Funnel analysis: Time on upsell pages

Example HITS chain:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Attribute credit to first article visited on purchase

● Funnel Analysis
● Email URL click heatmap
● Email Health Dashboard (SPAM, ISP deferral, content
A/B split tests, trends or low open rate campaigns)
● Advanced segmentation (all raw data stored)
● Behavioral analytics - engaged users etc...
Achievements Continued

● SQL language to run BigData queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
Our benefits

● No manual sharding
● No capacity guessing
● No idle resources
● No maintenance windows
● No manual scaling
● No file mgmt
BigQuery: Serverless Data Warehouse

● No servers to provision or manage
● Abstract away the complexity
● Scales with usage (ready every time for viral spikes or #BlackFriday)
● Availability and fault tolerance built in
● No orchestration in code
● Never pay for idle
● Cost savings (ps: we don’t have the same budget for security like GCP or AWS)
● Decoupled: APIs as contracts
● Monitored: Metrics and logging are a universal right
● Think concurrent, stateless, queue, stream based.
Serverlessmeans

Easily Build Custom Reports and Dashboards

Thank you.
Slides available on:
slideshare.net/martonkodok
Reea.net - Integrated web solutions driven by
creativity to deliver projects.

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery

Similar to CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery (20)

More from Márton Kodok

More from Márton Kodok (19)

Recently uploaded

Recently uploaded (20)

CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery