Complex Realtime
Event Analytics using BigQuery
Márton Kodok
Senior Software Engineer at REEA
twitter: martonkodok stackoverflow: pentium10 github: pentium10
Crunch Warm Up - October 2015 - Budapest
Agenda
1. Big Data movement
2. Analytics Project - Background
3. Challenges - Why is it so hard?
4. Approach - Strategy - Application
5. Use Cases - Implementations
6. Exploring Big Data (GDELT, Hackernews, Reddit)
Complex Realtime Event Analytics using BigQuery @martonkodok
Big data analyses movement
Every scientist who needs
big data analytics to save millions of lives
should have that power.
Complex Realtime Event Analytics using BigQuery @martonkodok
Challenging experience
The simple fact is that
you are brilliant
but your brilliant ideas require
complex big data analytics.
Complex Realtime Event Analytics using BigQuery @martonkodok
Project: One-size-fits-all problem
Need a backend to store, query, extract for deep analytics:
● Events (product, app, site email events)
● Achievements (“tag” users on the go, retention)
● Entities (split tests, user profiles, business entities)
● Metrics (app profiler data, custom)
● Email activity (click-map, engagement, ISP, Spam)
● 3rd party Analytics (good to have: Google Analytics)
● Systems generated data (log file entries, unstructured)
Complex Realtime Event Analytics using BigQuery @martonkodok
Desired system/platform
● Terabyte scalable storage
● Real-time event ingestion
● Ask sophisticated queries (optional: without Dev)
● Query-performance
● Low-maintenance
● Cost effective
● Wire them up easily
Goal: Store everything accessible by SQL immediately.
Complex Realtime Event Analytics using BigQuery @martonkodok
Equipment strategy
● In-House
● Hosted
● Managed
* people still required
Services:
❏ ELK Stack (Elastic-Logstash-Kibana)...
❏ Cassandra, Hive, Hadoop...
❏ Amazon RedShift, Google BigQuery...
Complex Realtime Event Analytics using BigQuery @martonkodok
Complex Realtime Event Analytics using BigQuery @martonkodok
Google BigQuery
What is BigQuery?
● Analytics-as-a-Service - Data Warehouse in the Cloud
● Fully-Managed
● Scales into Petabytes
● Ridiculously fast
● Decent pricing (queries $5/TB, storage: $20/TB)
● 100.000 rows / sec Streaming API
* October 2015 pricing
Complex Realtime Event Analytics using BigQuery @martonkodok
BigQuery: Big Data Analytics in the Cloud
● Convenience of SQL
● Familiar DB Structure (table, column, views, JSON)
● Open Interfaces (REST, Web UI, ODBC)
● Fast atomic imports JSON/CSV (file size up to 5TB)
● Simple data ingest from GCS or Hadoop
● Web UI + bq CLI
● Connectors: Hadoop, Tableau, R, Talend, Logstash
● US or EU zone
Complex Realtime Event Analytics using BigQuery @martonkodok
BigQuery: Convenience of SQL/JSON/JS
● Append-only tables
● Batch load file size limits: 5TB (CSV or JSON)
● ACL - row level locking (individual or group based)
● Columnar storage (max 10 000 columns in table)
● Rich SQL: JSON,IP,Math,RegExp,Window functions
● Datatypes: String 2MB, Record, Nested …
● UDF (User defined functions): Javascript
Note: Store what you can in columns, the rest in JSON.
Complex Realtime Event Analytics using BigQuery @martonkodok
BigQuery Costs - October 2015
* 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD
Queries Storage Ingestion
➔ 1 TB per month free
➔ 5 USD per TB
➔ only pay for the columns
you use in your query
➔ 20 USD per TB ➔ Batch load free (CSV/JSON)
➔ Exporting free
➔ Table copy free
➔ 1 USD per 20TB data
Estimate 1
- Storage 5 TB
- Streaming Inserts 5TB
- Queries 3 TB
Monthly total: 110 USD
Estimate 2
- Storage 20 TB
- Streaming Inserts 10TB
- Queries 10 TB
Monthly total: 455 USD
Complex Realtime Event Analytics using BigQuery @martonkodok
UDF - Power of Javascript
● impossible to express in SQL: Loops, complex
conditionals, string parsing or transformations
● UDFs are similar to map functions in MapReduce
● inline JS or from GCS (gs://some-bucket/js/lib.js)
Some UDF use cases:
● take one row and emit zero or more rows
● decoding URL-encoded strings
● text readability
Complex Realtime Event Analytics using BigQuery @martonkodok
Append only tables - Get last value
1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table.
2. Use analytic functions FIRST_VALUE and LAST_VALUE.
SELECT LAST_VALUE(email) OVER(
PARTITION BY user_id
ORDER BY timestamp ASC) AS email_last ...
3. Using Window Functions
SELECT email, firstname, lastname
FROM
(SELECT email, firstname, lastname
row_number() over (partition BY user_id
ORDER BY timestamp DESC) seqnum
FROM [profile_event]
)
WHERE seqnum=1
Complex Realtime Event Analytics using BigQuery @martonkodok
Table wildcard functions
This example assumes the following tables exist:
● mydata.people20140323
● mydata.people20140324
● mydata.people20140325
SELECT
name
FROM
(TABLE_DATE_RANGE(mydata.people,
DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'),
CURRENT_TIMESTAMP()))
WHERE
age >= 35
#... another example with RegExp ...
FROM
(TABLE_QUERY(mydata,
'REGEXP_MATCH(table_id, r"^boo[d]{3,5}")'))
Complex Realtime Event Analytics using BigQuery @martonkodok
Infrastructure
Complex Realtime Event Analytics using BigQuery @martonkodok
Schema modelling
Complex Realtime Event Analytics using BigQuery @martonkodok
+--------------------------+-----------+----------+--+
| order_id | INTEGER | REQUIRED | |
| ... | | | |
| products | RECORD | REPEATED | |
| products.product_id | INTEGER | NULLABLE | |
| products.attributes | STRING | REPEATED | |
| products.price | FLOAT | NULLABLE | |
| products.name | STRING | NULLABLE | |
| ... | | | |
| common | RECORD | NULLABLE | |
| common.insert_id | INTEGER | REQUIRED | |
| common.tenant | INTEGER | REQUIRED | |
| common.event | INTEGER | REQUIRED | |
| common.user_id | INTEGER | REQUIRED | |
| common.timestamp | TIMESTAMP | REQUIRED | |
| .... | | | |
| common.utm | RECORD | NULLABLE | |
| common.utm.source | STRING | NULLABLE | |
| common.utm.medium | STRING | NULLABLE | |
| common.utm.campaign | STRING | NULLABLE | |
| common.utm.content | STRING | NULLABLE | |
| common.utm.term | STRING | NULLABLE | |
| meta | STRING | NULLABLE | |
+--------------------------+-----------+----------+--+
Streaming insert time (ms) - last 6M
Complex Realtime Event Analytics using BigQuery @martonkodok
Achievements
● Funnel Analysis
Complex Realtime Event Analytics using BigQuery @martonkodok
Attribute orders to first article visited
Example:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Problem: When an order is made, attribute a credit to the first article visited by that user!
Complex Realtime Event Analytics using BigQuery @martonkodok
Achievements
● Funnel Analysis
● Email URL click heatmap
Complex Realtime Event Analytics using BigQuery @martonkodok
Email URL clicks map (79GB in 2.4sec)
Complex Realtime Event Analytics using BigQuery @martonkodok
Achievements Continued
● Funnel Analysis
● Email URL click heatmap
● Email Dashboard (Trends, SPAM, ISP deferral)
● Split tests (by content, region, device, during the day)
● Ability for advanced segmentation as all raw data is stored
● Behavioral analytics (engaged users, recommendations)
Complex Realtime Event Analytics using BigQuery @martonkodok
Our benefits
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
● pay only the columns we have in your queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data.
Complex Realtime Event Analytics using BigQuery @martonkodok
BigQuery: Sample projects to try out
1. githubarchive.org: 20+ event types available since 2012
a. pull request latency
b. expressions, emotions in commit messages
2. httparchive.org: Trends in web technology
a. popular scripts
b. website performance
3. raw Google Analytics data (*only Premium Customers)
4. GDELT - Global Database of Events, Language, and Tone
GKG - Global Knowledge Graph
5. GSOD - samples of weather (rainfall, temp…)
6. 1.6 billion Reddit comments
7. Hackernews data
8. Wikipedia edits
Complex Realtime Event Analytics using BigQuery @martonkodok
HttpArchive - .HU Javascript frameworks
Complex Realtime Event Analytics using BigQuery @martonkodok
GDELT - News Coverage: Orbán Viktor
Complex Realtime Event Analytics using BigQuery @martonkodok
GDELT - News Coverage: Beata Szydlo
Complex Realtime Event Analytics using BigQuery @martonkodok
Reddit - books community talks about
Complex Realtime Event Analytics using BigQuery @martonkodok
Questions?
Thank you.

Complex realtime event analytics using BigQuery @Crunch Warmup

  • 1.
    Complex Realtime Event Analyticsusing BigQuery Márton Kodok Senior Software Engineer at REEA twitter: martonkodok stackoverflow: pentium10 github: pentium10 Crunch Warm Up - October 2015 - Budapest
  • 2.
    Agenda 1. Big Datamovement 2. Analytics Project - Background 3. Challenges - Why is it so hard? 4. Approach - Strategy - Application 5. Use Cases - Implementations 6. Exploring Big Data (GDELT, Hackernews, Reddit) Complex Realtime Event Analytics using BigQuery @martonkodok
  • 3.
    Big data analysesmovement Every scientist who needs big data analytics to save millions of lives should have that power. Complex Realtime Event Analytics using BigQuery @martonkodok
  • 4.
    Challenging experience The simplefact is that you are brilliant but your brilliant ideas require complex big data analytics. Complex Realtime Event Analytics using BigQuery @martonkodok
  • 5.
    Project: One-size-fits-all problem Needa backend to store, query, extract for deep analytics: ● Events (product, app, site email events) ● Achievements (“tag” users on the go, retention) ● Entities (split tests, user profiles, business entities) ● Metrics (app profiler data, custom) ● Email activity (click-map, engagement, ISP, Spam) ● 3rd party Analytics (good to have: Google Analytics) ● Systems generated data (log file entries, unstructured) Complex Realtime Event Analytics using BigQuery @martonkodok
  • 6.
    Desired system/platform ● Terabytescalable storage ● Real-time event ingestion ● Ask sophisticated queries (optional: without Dev) ● Query-performance ● Low-maintenance ● Cost effective ● Wire them up easily Goal: Store everything accessible by SQL immediately. Complex Realtime Event Analytics using BigQuery @martonkodok
  • 7.
    Equipment strategy ● In-House ●Hosted ● Managed * people still required Services: ❏ ELK Stack (Elastic-Logstash-Kibana)... ❏ Cassandra, Hive, Hadoop... ❏ Amazon RedShift, Google BigQuery... Complex Realtime Event Analytics using BigQuery @martonkodok
  • 8.
    Complex Realtime EventAnalytics using BigQuery @martonkodok Google BigQuery
  • 9.
    What is BigQuery? ●Analytics-as-a-Service - Data Warehouse in the Cloud ● Fully-Managed ● Scales into Petabytes ● Ridiculously fast ● Decent pricing (queries $5/TB, storage: $20/TB) ● 100.000 rows / sec Streaming API * October 2015 pricing Complex Realtime Event Analytics using BigQuery @martonkodok
  • 10.
    BigQuery: Big DataAnalytics in the Cloud ● Convenience of SQL ● Familiar DB Structure (table, column, views, JSON) ● Open Interfaces (REST, Web UI, ODBC) ● Fast atomic imports JSON/CSV (file size up to 5TB) ● Simple data ingest from GCS or Hadoop ● Web UI + bq CLI ● Connectors: Hadoop, Tableau, R, Talend, Logstash ● US or EU zone Complex Realtime Event Analytics using BigQuery @martonkodok
  • 11.
    BigQuery: Convenience ofSQL/JSON/JS ● Append-only tables ● Batch load file size limits: 5TB (CSV or JSON) ● ACL - row level locking (individual or group based) ● Columnar storage (max 10 000 columns in table) ● Rich SQL: JSON,IP,Math,RegExp,Window functions ● Datatypes: String 2MB, Record, Nested … ● UDF (User defined functions): Javascript Note: Store what you can in columns, the rest in JSON. Complex Realtime Event Analytics using BigQuery @martonkodok
  • 12.
    BigQuery Costs -October 2015 * 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD Queries Storage Ingestion ➔ 1 TB per month free ➔ 5 USD per TB ➔ only pay for the columns you use in your query ➔ 20 USD per TB ➔ Batch load free (CSV/JSON) ➔ Exporting free ➔ Table copy free ➔ 1 USD per 20TB data Estimate 1 - Storage 5 TB - Streaming Inserts 5TB - Queries 3 TB Monthly total: 110 USD Estimate 2 - Storage 20 TB - Streaming Inserts 10TB - Queries 10 TB Monthly total: 455 USD Complex Realtime Event Analytics using BigQuery @martonkodok
  • 13.
    UDF - Powerof Javascript ● impossible to express in SQL: Loops, complex conditionals, string parsing or transformations ● UDFs are similar to map functions in MapReduce ● inline JS or from GCS (gs://some-bucket/js/lib.js) Some UDF use cases: ● take one row and emit zero or more rows ● decoding URL-encoded strings ● text readability Complex Realtime Event Analytics using BigQuery @martonkodok
  • 14.
    Append only tables- Get last value 1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table. 2. Use analytic functions FIRST_VALUE and LAST_VALUE. SELECT LAST_VALUE(email) OVER( PARTITION BY user_id ORDER BY timestamp ASC) AS email_last ... 3. Using Window Functions SELECT email, firstname, lastname FROM (SELECT email, firstname, lastname row_number() over (partition BY user_id ORDER BY timestamp DESC) seqnum FROM [profile_event] ) WHERE seqnum=1 Complex Realtime Event Analytics using BigQuery @martonkodok
  • 15.
    Table wildcard functions Thisexample assumes the following tables exist: ● mydata.people20140323 ● mydata.people20140324 ● mydata.people20140325 SELECT name FROM (TABLE_DATE_RANGE(mydata.people, DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'), CURRENT_TIMESTAMP())) WHERE age >= 35 #... another example with RegExp ... FROM (TABLE_QUERY(mydata, 'REGEXP_MATCH(table_id, r"^boo[d]{3,5}")')) Complex Realtime Event Analytics using BigQuery @martonkodok
  • 16.
    Infrastructure Complex Realtime EventAnalytics using BigQuery @martonkodok
  • 17.
    Schema modelling Complex RealtimeEvent Analytics using BigQuery @martonkodok +--------------------------+-----------+----------+--+ | order_id | INTEGER | REQUIRED | | | ... | | | | | products | RECORD | REPEATED | | | products.product_id | INTEGER | NULLABLE | | | products.attributes | STRING | REPEATED | | | products.price | FLOAT | NULLABLE | | | products.name | STRING | NULLABLE | | | ... | | | | | common | RECORD | NULLABLE | | | common.insert_id | INTEGER | REQUIRED | | | common.tenant | INTEGER | REQUIRED | | | common.event | INTEGER | REQUIRED | | | common.user_id | INTEGER | REQUIRED | | | common.timestamp | TIMESTAMP | REQUIRED | | | .... | | | | | common.utm | RECORD | NULLABLE | | | common.utm.source | STRING | NULLABLE | | | common.utm.medium | STRING | NULLABLE | | | common.utm.campaign | STRING | NULLABLE | | | common.utm.content | STRING | NULLABLE | | | common.utm.term | STRING | NULLABLE | | | meta | STRING | NULLABLE | | +--------------------------+-----------+----------+--+
  • 18.
    Streaming insert time(ms) - last 6M Complex Realtime Event Analytics using BigQuery @martonkodok
  • 19.
    Achievements ● Funnel Analysis ComplexRealtime Event Analytics using BigQuery @martonkodok
  • 20.
    Attribute orders tofirst article visited Example: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... Problem: When an order is made, attribute a credit to the first article visited by that user! Complex Realtime Event Analytics using BigQuery @martonkodok
  • 21.
    Achievements ● Funnel Analysis ●Email URL click heatmap Complex Realtime Event Analytics using BigQuery @martonkodok
  • 22.
    Email URL clicksmap (79GB in 2.4sec) Complex Realtime Event Analytics using BigQuery @martonkodok
  • 23.
    Achievements Continued ● FunnelAnalysis ● Email URL click heatmap ● Email Dashboard (Trends, SPAM, ISP deferral) ● Split tests (by content, region, device, during the day) ● Ability for advanced segmentation as all raw data is stored ● Behavioral analytics (engaged users, recommendations) Complex Realtime Event Analytics using BigQuery @martonkodok
  • 24.
    Our benefits ● noprovisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) ● pay only the columns we have in your queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data. Complex Realtime Event Analytics using BigQuery @martonkodok
  • 25.
    BigQuery: Sample projectsto try out 1. githubarchive.org: 20+ event types available since 2012 a. pull request latency b. expressions, emotions in commit messages 2. httparchive.org: Trends in web technology a. popular scripts b. website performance 3. raw Google Analytics data (*only Premium Customers) 4. GDELT - Global Database of Events, Language, and Tone GKG - Global Knowledge Graph 5. GSOD - samples of weather (rainfall, temp…) 6. 1.6 billion Reddit comments 7. Hackernews data 8. Wikipedia edits Complex Realtime Event Analytics using BigQuery @martonkodok
  • 26.
    HttpArchive - .HUJavascript frameworks Complex Realtime Event Analytics using BigQuery @martonkodok
  • 27.
    GDELT - NewsCoverage: Orbán Viktor Complex Realtime Event Analytics using BigQuery @martonkodok
  • 28.
    GDELT - NewsCoverage: Beata Szydlo Complex Realtime Event Analytics using BigQuery @martonkodok
  • 29.
    Reddit - bookscommunity talks about Complex Realtime Event Analytics using BigQuery @martonkodok
  • 30.