Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Complex realtime event analytics using BigQuery @Crunch Warmup

3,643 views

Published on

Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.

Published in: Software

Complex realtime event analytics using BigQuery @Crunch Warmup

  1. 1. Complex Realtime Event Analytics using BigQuery Márton Kodok Senior Software Engineer at REEA twitter: martonkodok stackoverflow: pentium10 github: pentium10 Crunch Warm Up - October 2015 - Budapest
  2. 2. Agenda 1. Big Data movement 2. Analytics Project - Background 3. Challenges - Why is it so hard? 4. Approach - Strategy - Application 5. Use Cases - Implementations 6. Exploring Big Data (GDELT, Hackernews, Reddit) Complex Realtime Event Analytics using BigQuery @martonkodok
  3. 3. Big data analyses movement Every scientist who needs big data analytics to save millions of lives should have that power. Complex Realtime Event Analytics using BigQuery @martonkodok
  4. 4. Challenging experience The simple fact is that you are brilliant but your brilliant ideas require complex big data analytics. Complex Realtime Event Analytics using BigQuery @martonkodok
  5. 5. Project: One-size-fits-all problem Need a backend to store, query, extract for deep analytics: ● Events (product, app, site email events) ● Achievements (“tag” users on the go, retention) ● Entities (split tests, user profiles, business entities) ● Metrics (app profiler data, custom) ● Email activity (click-map, engagement, ISP, Spam) ● 3rd party Analytics (good to have: Google Analytics) ● Systems generated data (log file entries, unstructured) Complex Realtime Event Analytics using BigQuery @martonkodok
  6. 6. Desired system/platform ● Terabyte scalable storage ● Real-time event ingestion ● Ask sophisticated queries (optional: without Dev) ● Query-performance ● Low-maintenance ● Cost effective ● Wire them up easily Goal: Store everything accessible by SQL immediately. Complex Realtime Event Analytics using BigQuery @martonkodok
  7. 7. Equipment strategy ● In-House ● Hosted ● Managed * people still required Services: ❏ ELK Stack (Elastic-Logstash-Kibana)... ❏ Cassandra, Hive, Hadoop... ❏ Amazon RedShift, Google BigQuery... Complex Realtime Event Analytics using BigQuery @martonkodok
  8. 8. Complex Realtime Event Analytics using BigQuery @martonkodok Google BigQuery
  9. 9. What is BigQuery? ● Analytics-as-a-Service - Data Warehouse in the Cloud ● Fully-Managed ● Scales into Petabytes ● Ridiculously fast ● Decent pricing (queries $5/TB, storage: $20/TB) ● 100.000 rows / sec Streaming API * October 2015 pricing Complex Realtime Event Analytics using BigQuery @martonkodok
  10. 10. BigQuery: Big Data Analytics in the Cloud ● Convenience of SQL ● Familiar DB Structure (table, column, views, JSON) ● Open Interfaces (REST, Web UI, ODBC) ● Fast atomic imports JSON/CSV (file size up to 5TB) ● Simple data ingest from GCS or Hadoop ● Web UI + bq CLI ● Connectors: Hadoop, Tableau, R, Talend, Logstash ● US or EU zone Complex Realtime Event Analytics using BigQuery @martonkodok
  11. 11. BigQuery: Convenience of SQL/JSON/JS ● Append-only tables ● Batch load file size limits: 5TB (CSV or JSON) ● ACL - row level locking (individual or group based) ● Columnar storage (max 10 000 columns in table) ● Rich SQL: JSON,IP,Math,RegExp,Window functions ● Datatypes: String 2MB, Record, Nested … ● UDF (User defined functions): Javascript Note: Store what you can in columns, the rest in JSON. Complex Realtime Event Analytics using BigQuery @martonkodok
  12. 12. BigQuery Costs - October 2015 * 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD Queries Storage Ingestion ➔ 1 TB per month free ➔ 5 USD per TB ➔ only pay for the columns you use in your query ➔ 20 USD per TB ➔ Batch load free (CSV/JSON) ➔ Exporting free ➔ Table copy free ➔ 1 USD per 20TB data Estimate 1 - Storage 5 TB - Streaming Inserts 5TB - Queries 3 TB Monthly total: 110 USD Estimate 2 - Storage 20 TB - Streaming Inserts 10TB - Queries 10 TB Monthly total: 455 USD Complex Realtime Event Analytics using BigQuery @martonkodok
  13. 13. UDF - Power of Javascript ● impossible to express in SQL: Loops, complex conditionals, string parsing or transformations ● UDFs are similar to map functions in MapReduce ● inline JS or from GCS (gs://some-bucket/js/lib.js) Some UDF use cases: ● take one row and emit zero or more rows ● decoding URL-encoded strings ● text readability Complex Realtime Event Analytics using BigQuery @martonkodok
  14. 14. Append only tables - Get last value 1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table. 2. Use analytic functions FIRST_VALUE and LAST_VALUE. SELECT LAST_VALUE(email) OVER( PARTITION BY user_id ORDER BY timestamp ASC) AS email_last ... 3. Using Window Functions SELECT email, firstname, lastname FROM (SELECT email, firstname, lastname row_number() over (partition BY user_id ORDER BY timestamp DESC) seqnum FROM [profile_event] ) WHERE seqnum=1 Complex Realtime Event Analytics using BigQuery @martonkodok
  15. 15. Table wildcard functions This example assumes the following tables exist: ● mydata.people20140323 ● mydata.people20140324 ● mydata.people20140325 SELECT name FROM (TABLE_DATE_RANGE(mydata.people, DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'), CURRENT_TIMESTAMP())) WHERE age >= 35 #... another example with RegExp ... FROM (TABLE_QUERY(mydata, 'REGEXP_MATCH(table_id, r"^boo[d]{3,5}")')) Complex Realtime Event Analytics using BigQuery @martonkodok
  16. 16. Infrastructure Complex Realtime Event Analytics using BigQuery @martonkodok
  17. 17. Schema modelling Complex Realtime Event Analytics using BigQuery @martonkodok +--------------------------+-----------+----------+--+ | order_id | INTEGER | REQUIRED | | | ... | | | | | products | RECORD | REPEATED | | | products.product_id | INTEGER | NULLABLE | | | products.attributes | STRING | REPEATED | | | products.price | FLOAT | NULLABLE | | | products.name | STRING | NULLABLE | | | ... | | | | | common | RECORD | NULLABLE | | | common.insert_id | INTEGER | REQUIRED | | | common.tenant | INTEGER | REQUIRED | | | common.event | INTEGER | REQUIRED | | | common.user_id | INTEGER | REQUIRED | | | common.timestamp | TIMESTAMP | REQUIRED | | | .... | | | | | common.utm | RECORD | NULLABLE | | | common.utm.source | STRING | NULLABLE | | | common.utm.medium | STRING | NULLABLE | | | common.utm.campaign | STRING | NULLABLE | | | common.utm.content | STRING | NULLABLE | | | common.utm.term | STRING | NULLABLE | | | meta | STRING | NULLABLE | | +--------------------------+-----------+----------+--+
  18. 18. Streaming insert time (ms) - last 6M Complex Realtime Event Analytics using BigQuery @martonkodok
  19. 19. Achievements ● Funnel Analysis Complex Realtime Event Analytics using BigQuery @martonkodok
  20. 20. Attribute orders to first article visited Example: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... Problem: When an order is made, attribute a credit to the first article visited by that user! Complex Realtime Event Analytics using BigQuery @martonkodok
  21. 21. Achievements ● Funnel Analysis ● Email URL click heatmap Complex Realtime Event Analytics using BigQuery @martonkodok
  22. 22. Email URL clicks map (79GB in 2.4sec) Complex Realtime Event Analytics using BigQuery @martonkodok
  23. 23. Achievements Continued ● Funnel Analysis ● Email URL click heatmap ● Email Dashboard (Trends, SPAM, ISP deferral) ● Split tests (by content, region, device, during the day) ● Ability for advanced segmentation as all raw data is stored ● Behavioral analytics (engaged users, recommendations) Complex Realtime Event Analytics using BigQuery @martonkodok
  24. 24. Our benefits ● no provisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) ● pay only the columns we have in your queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data. Complex Realtime Event Analytics using BigQuery @martonkodok
  25. 25. BigQuery: Sample projects to try out 1. githubarchive.org: 20+ event types available since 2012 a. pull request latency b. expressions, emotions in commit messages 2. httparchive.org: Trends in web technology a. popular scripts b. website performance 3. raw Google Analytics data (*only Premium Customers) 4. GDELT - Global Database of Events, Language, and Tone GKG - Global Knowledge Graph 5. GSOD - samples of weather (rainfall, temp…) 6. 1.6 billion Reddit comments 7. Hackernews data 8. Wikipedia edits Complex Realtime Event Analytics using BigQuery @martonkodok
  26. 26. HttpArchive - .HU Javascript frameworks Complex Realtime Event Analytics using BigQuery @martonkodok
  27. 27. GDELT - News Coverage: Orbán Viktor Complex Realtime Event Analytics using BigQuery @martonkodok
  28. 28. GDELT - News Coverage: Beata Szydlo Complex Realtime Event Analytics using BigQuery @martonkodok
  29. 29. Reddit - books community talks about Complex Realtime Event Analytics using BigQuery @martonkodok
  30. 30. Questions? Thank you.

×