Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery

Powering Interactive Data Analysis
with Google BigQuery
Márton Kodok / @martonkodok
Google Developer Expert at REEA.net
Nov 2017 - Cluj Napoca, Romania

● Geek. Hiker. Do-er.
● Among the Top3 romanians on Stackoverflow
● Google Developer Expert on Cloud technologies
● Crafting Web/Mobile backends at REEA.net
● BigQuery and database engine expert
● Active in mentoring
Twitter: @martonkodok
StackOverflow: pentium10
Slideshare: martonkodok
GitHub: pentium10
Powering Interactive Data Analysis with Google BigQuery @martonkodok
About me

Everycompany,
no matter how far from the tech they are,
isevolvingintoasoftwarecompany,
and by extension a datacompany.
Turning everything into “data” drives innovation
Powering Interactive Data Analysis with Google BigQuery @martonkodok #dfua

For a small company it’simportant
to have access to modernBigDatatools
withoutrunningadedicatedteam for it.
Small companies should do BigData - but how?

❏ Need backend/database to STORE, QUERY, EXTRACT data
❏ Deep analytics - large, multi-source, complex, unstructured
❏ Be real time
❏ Petabyte scale - Cost effective
❏ Run Ad-Hoc reports - Without Developer - interactive
❏ Minimal engineering efforts - no dedicated BigData team
❏ Simple Query language (prefered SQL / Javascript)
Making analytics accessible to more companies

“We can't solve problems by
using the same kind of
thinking we used when we
created them”
-Albert Einstein
The Challenge

Legacy Business Reporting System
Web
Mobile
Web Server
Database
SQL
Cached
Platform Services
CMS/Framework
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances

Web
Mobile
Web Server
Database
SQL
Cached
Platform Services
CMS/Framework
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
BehindtheScenes:
DaysToInsights

Legacy Business Reporting System
Web
Mobile
Web Server
Database
SQL
Cached
Platform Services
CMS/Framework
Report & Share
Business Analysis
Scheduled
Tasks
Batch Processing
Compute Engine
Multiple Instances
Minutes
to kick in
Hours to Run
Batch Processing
Hours to Clean and
Aggregate
DAYS TO
INSIGHTS

● Terabyte scalable storage
● Real-time row ingestion
● Ask sophisticated queries
● Query-performance
● Low-maintenance
● Cost effective
● Wire them up easily
Goal: Store everything accessible by SQL immediately.
Desired system/platform
Engines:
● MongoDB, Riak, Redis
● ELK Stack (Elasticsearch-Logstash-Kibana)
● Cassandra, Hive, Hadoop...
● Amazon Athena, Google BigQuery...

● Analytics-as-a-Service - Data Warehouse in the Cloud
● Fully-Managed by Google (US or EU zone)
● Scales into Petabytes
● Ridiculously fast
● SQL 2011 Standard + Javascript UDF (User Defined Functions)
● Familiar DB Structure (table, views, record, nested, JSON)
● Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
● Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
● Decent pricing (queries $5/TB, storage: $20/TB cold: $10/TB) *Oct 2017
What is BigQuery?

● Columnar storage (max 10 000 columns in table)
● Batch load file size limits: 5TB (CSV or JSON)
● User Defined Functions in SQL or Javascript
● Rich SQL 2011: JSON, IP, Math, RegExp, Window functions
● Data types: String, Integer, Float, Boolean, Timestamp,
Record, Nested, Struct, Array.
● Append-only tables prefered (DML syntax available)
● Day partitioned tables
● ACL - row level locking (individual or group based)
BigQuery: Convenience of SQL

Architecting for The Cloud
BigQuery
On-Premises Servers
Pipelines
ETL
Engine
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming

Data Pipeline Integration at REEA.net
Analytics Backend
BigQuery
On-Premises Servers
Pipelines
FluentD
Event Sourcing
Frontend
Platform Services
Metrics / Logs/
Streaming
Development
Team
Data Analysts
Report & Share
Business Analysis
Tools
Tableau
QlikView
Data Studio
Internal
Dashboard
Database
SQL
Application
ServersServers
Cloud Storage
archive
Load
Export
Replay
Standard
Devices
HTTPS

The following slides will present a sample Fluentd configuration to:
1. Transform a record
2. Copy event to multiple outputs
3. Store event data in File (for backup/log purposes)
4. Stream to BigQuery (for immediate analyses)

<filter frontend.user.*>
@type record_transformer
</filter>
<match frontend.user.*>
@type copy
<store>
@type forest
subtype file
</store>
<store>
@type bigquery
</store>
…
</match>
Filter plugin mutates incoming data. Add/modify/delete
event data transform attributes without a code deploy.1
2
3
4
The copy output plugin copies events to multiple outputs.
File(s), multiple databases, DB engines.
Great to ship same event to multiple subsystems.
The Bigquery output plugin on the fly streams the event to
the BigQuery warehouse. No need to write integration.
Data is available immediately for querying.
Whenever needed other output plugins can be wired in:
Kafka, Google Cloud Storage output plugin.

record_transformer copy file BigQuery
<filter frontend.user.*>
@type record_transformer
enable_ruby
remove_keys host
<record>
bq {"insert_id":"${uid}","host":"${host}",
"created":"${time.to_i}"}
avg ${record["total"] / record["count"]}
</record>
</filter>
syntax: Ruby, easy to use.
Great for:
- date transformation,
- quick normalizations,
- calculating something on the fly,
and store in clear log/analytics db
- renaming without code deploy.
1
2 3 4

@type copy
<store>
@type forest
subtype file
<template>
path /tank/storage/${tag}.*.log
time_slice_format %Y%m%d
</template>
</store>
</match>
1
2 3 4

@type bigquery
method insert
auth_method json_key
json_key /etc/td-agent/keys/key-31da042be48c.json
project project_id
dataset dataset_name
time_field timestamp
time_slice_format %Y%m%d
table user$%{time_slice}
ignore_unknown_values
schema_path /etc/td-agent/schema/user_login.json
</match>
1
2 3 4
Connector uses:
- JSON key auth file
- JSON table schema
Pro features:
- streaming to Partitioned tables
- ignore unknown values
(not reflected in schema)

● On data that it is difficult to process/analyze using traditional databases
● On exploring unstructured data
● Not a replacement to traditional DBs, but it compliments the system
● Applying Javascript UDF on columnar storage to resolve complex tasks
(eg: JS for natural language processing)
● On streams (forms, Kafka)
● On IoT streams
● Major strength is handling Large datasets
Where to use BigQuery?

➢ Optimize product pages
Find, store, analyse in BQ time consuming user actions from using
25x more custom events/hits than Google Analytics
➢ Email engagement
Having stored every open/click raw data improve: subject line, layout,
follow up action emails, assistant like experience by heavy
A/B Split Tests on email marketing campaigns (interactive feedback loop)
➢ Funnel Analysis
Wrangle all the data to discover: a small improvement, an AI driven
upsell personal like experience, pre-sell products configured on the go -
not yet in catalog, but easily can be tweaked/customized
Achievements - goal reached by measuring everything

Go to the BigQuery web UI.
https://bigquery.cloud.google.com/
Query a public dataset

Romanian stations that record the most days of snow

Mentions of RO politicians since ‘16 Nov in GDELT articles

● Funnel Analysis
Achievements

Funnel analysis: Time on upsell pages

Example HITS chain:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Attribute credit to first article visited on purchase

● Funnel Analysis
● Email URL click heatmap
Achievements

Email URL clicks heat-map

● Funnel Analysis
● Email URL click heatmap
● Email Health Dashboard (SPAM, ISP deferral, content
A/B split tests, trends or low open rate campaigns)
● Advanced segmentation (all raw data stored)
● Behavioral analytics - engaged users etc...
Achievements Continued

● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data.
Our benefits

● No manual sharding
● No capacity guessing
● No idle resources
● No maintenance windows
● No manual scaling
● No file mgmt
BigQuery: Serverless Data Warehouse
serverless data warehouse depicted

Easily Build Custom Reports and Dashboards

Thank you.
Slides available on: slideshare.net/martonkodok
Reea.net - Integrated web solutions driven by creativity to deliver projects.

Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery

Similar to Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery (20)

More from Márton Kodok

More from Márton Kodok (17)

Recently uploaded

Recently uploaded (20)

Voxxed Days Cluj - Powering interactive data analysis with Google BigQuery