Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Google BigQuery for Everyday Developer

749 views

Published on

IV. IT&C Innovation Conference - October 2016 - Sovata, Romania

A. Every scientist who needs big data analytics to save millions of lives should have that power
Legacy systems don’t provide the power.

B. The simple fact is that you are brilliant but your brilliant ideas require complex analytics.
Traditional solutions are not applicable.

The Plan: have oversight over developments as they happen.
Goal: Store everything accessible by SQL immediately.

What is BigQuery?
Analytics-as-a-Service - Data Warehouse in the Cloud
Fully-Managed by Google (US or EU zone)
Scales into Petabytes
Ridiculously fast
Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing
100.000 rows / sec Streaming API
Open Interfaces (Web UI, BQ command line tool, REST, ODBC)
Familiar DB Structure (table, views, record, nested, JSON)
Convenience of SQL + Javascript UDF (User Defined Functions)
Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors
Client libraries available in YFL (your favorite languages)

Our benefits
no provisioning/deploy
no running out of resources
no more focus on large scale execution plan
no need to re-implement tricky concepts
(time windows / join streams)
pay only the columns we have in your queries
run raw ad-hoc queries (either by analysts/sales or Devs)
no more throwing away-, expiring-, aggregating old data.

Published in: Software
  • Be the first to comment

Google BigQuery for Everyday Developer

  1. 1. Google BigQuery for the Everyday Developer Márton Kodok Senior Software Engineer at REEA twitter: martonkodok stackoverflow: pentium10 github: @pentium10 IV. IT&C Innovation Conference - October 2016 - Sovata, Romania
  2. 2. 1. Big Data Challenges 2. Infrastructure overview 3. Choosing a Strategy 4. BigQuery - Use Cases 5. Q & A BigQuery for the Everyday Developer @martonkodok Agenda
  3. 3. BigQuery for the Everyday Developer @martonkodok The 5 V’s of Big Data
  4. 4. Every scientist who needs big data analytics to save millions of lives should have that power Legacy systems don’t provide the power. The simple fact is that you are brilliant but your brilliant ideas require complex analytics. Traditional solutions are not applicable. BigQuery for the Everyday Developer @martonkodok Big Data movement
  5. 5. With more and more people realizing the value of realtime analytics, the race to build realtime middleware has begun. The Plan: have oversight over developments as they happen. * More slides about Beanstalkd at slideshare.net/martonkodok BigQuery for the Everyday Developer @martonkodok Fast computing systems
  6. 6. BigQuery for the Everyday Developer @martonkodok Simple legacy Website stack/infrastructure
  7. 7. BigQuery for the Everyday Developer @martonkodok Infrastructure with Middleware
  8. 8. Task: Need backend/database for STORE, QUERY, EXTRACT deep analytics: ● Events (website events, app events, email events) ● Entities (user data, business entities, A/B split test data) ● Metrics (sales data, code commit history, profiler data) ● Streaming (sensor data, IoT data burst) ● 3rd party Analytics (Google Analytics, Mixpanel) ● Local generated data (log file entries, trace output, unstructured data...) ❏ Run Ad-Hoc queries (without Developer) ❏ Be realtime BigQuery for the Everyday Developer @martonkodok Requirements
  9. 9. ● Terabyte scalable storage ● Real-time row ingestion ● Ask sophisticated queries ● Query-performance ● Low-maintenance ● Cost effective ● Wire them up easily Goal: Store everything accessible by SQL immediately. BigQuery for the Everyday Developer @martonkodok Desired system/platform Equipment strategy * people still required Engines: ❏ MongoDB, Riak, Redis ❏ ELK Stack (Elastic-Logstash-Kibana)... ❏ Cassandra, Hive, Hadoop... ❏ Amazon RedShift, Google BigQuery... In-House Hosted Managed
  10. 10. BigQuery for the Everyday Developer @martonkodok Google BigQuery
  11. 11. ● Analytics-as-a-Service - Data Warehouse in the Cloud ● Fully-Managed by Google (US or EU zone) ● Scales into Petabytes ● Ridiculously fast ● Decent pricing (queries $5/TB, storage: $20/TB) *October 2016 pricing ● 100.000 rows / sec Streaming API ● Open Interfaces (Web UI, BQ command line tool, REST, ODBC) ● Familiar DB Structure (table, views, record, nested, JSON) ● Convenience of SQL + Javascript UDF (User Defined Functions) ● Integrates with Google Sheets + Google Cloud Storage + Pub/Sub connectors ● Client libraries available in YFL (your favorite languages) BigQuery for the Everyday Developer @martonkodok What is BigQuery?
  12. 12. ● ACL - row level locking (individual or group based) ● Columnar storage (max 10 000 columns in table) ● Rich SQL: JSON, IP, Math, RegExp, Window functions ● Data types: String, Integer, Float, Boolean, Timestamp, Record, Nested, Struct, Array. ● Append-only tables prefered (DML syntax available) ● Batch load file size limits: 1TB (CSV or JSON) ● User Defined Functions in SQL or Javascript ● Date partitioned tables BigQuery for the Everyday Developer @martonkodok BigQuery: Convenience of SQL
  13. 13. * 1 Petabyte storage, 10 TB inserts, 100 TB queries => $22000 Queries Storage Ingestion ➔ 1 TB per month free ➔ 5 USD per TB ➔ only pay for the columns you use in your query ➔ 20 USD per TB frequently accessed data ➔ 10 USD per TB long term storage 90 days ➔ Batch load free (CSV/JSON) ➔ Exporting free ➔ Table copy free ➔ 50 USD per TB Estimate 1 - Storage 5 TB - Streaming Inserts 1 TB - Queries 3 TB Monthly total: $165 Estimate 2 - Storage 24 TB - Streaming Inserts 1 TB - Queries 50 TB Monthly total: $788 BigQuery for the Everyday Developer @martonkodok BigQuery Costs - October 2016
  14. 14. BigQuery for the Everyday Developer @martonkodok +--------------------------+-----------+----------+ | order_id | INTEGER | REQUIRED | | ... | | | Example: | products | RECORD | REPEATED | ”products”:[p1,p2] | products.name | STRING | NULLABLE | | products.product_id | INTEGER | NULLABLE | ”products”:[{”name”:”p1”, | products.attributes | STRING | REPEATED | ”product_id”:10, | products.price | FLOAT | NULLABLE | ”attributes”:[”red”,”xl”] | ... | | | ,”price”:9.99}, | common | RECORD | NULLABLE | {”name”:”p2”, | common.insert_id | INTEGER | REQUIRED | ”product_id”:20, | common.event | INTEGER | REQUIRED | ”attributes”:[“red”,”xl”] | common.user_id | INTEGER | REQUIRED | ,”price”:9.99}] | common.timestamp | TIMESTAMP | REQUIRED | | common.utm | RECORD | NULLABLE | | common.utm.source | STRING | NULLABLE | | common.utm.medium | STRING | NULLABLE | | common.utm.campaign | STRING | NULLABLE | | common.utm.content | STRING | NULLABLE | | common.utm.term | STRING | NULLABLE | | meta | STRING | NULLABLE | +--------------------------+-----------+----------+ Schema modelling / JSON
  15. 15. 1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table. 2. Use analytic functions FIRST_VALUE and LAST_VALUE when only one column is needed SELECT LAST_VALUE(email) OVER( PARTITION BY user_id ORDER BY timestamp ASC) AS email_last ... 3. Using Window Functions for full row selection SELECT email, firstname, lastname FROM (SELECT email, firstname, lastname row_number() over (partition BY user_id ORDER BY timestamp DESC) seqnum FROM [profile_event] ) WHERE seqnum=1 BigQuery for the Everyday Developer @martonkodok Append only tables - Get last value
  16. 16. BigQuery for the Everyday Developer @martonkodok Streaming insert time (ms) - last year
  17. 17. ● Funnel Analysis BigQuery for the Everyday Developer @martonkodok Achievements
  18. 18. BigQuery for the Everyday Developer @martonkodok Funnel analysis: Time on upsell pages
  19. 19. Example HITS chain: ● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1 ● page1 -> article2-> page3 -> orderpage2 -> ... BigQuery for the Everyday Developer @martonkodok Attribute credit to first article visited on purchase
  20. 20. ● Funnel Analysis ● Email URL click heatmap BigQuery for the Everyday Developer @martonkodok Achievements
  21. 21. BigQuery for the Everyday Developer @martonkodok Email URL clicks heat-map
  22. 22. ● Funnel Analysis ● Email URL click heatmap ● Email Health Dashboard (SPAM, ISP deferral, content A/B split tests, trends or low open rate campaigns) ● Advanced segmentation (all raw data stored) ● Behavioral analytics - engaged users etc... BigQuery for the Everyday Developer @martonkodok Achievements Continued
  23. 23. ● no provisioning/deploy ● no running out of resources ● no more focus on large scale execution plan ● no need to re-implement tricky concepts (time windows / join streams) ● pay only the columns we have in your queries ● run raw ad-hoc queries (either by analysts/sales or Devs) ● no more throwing away-, expiring-, aggregating old data. BigQuery for the Everyday Developer @martonkodok Our benefits
  24. 24. 1. githubarchive.org: 20+ event types available since 2012 a. pull request latency b. expressions, emotions in commit messages c. etc... 2. httparchive.org a. sites that use multiple popular JS frameworks b. sites that use multiple jQuery versions 3. Export for Google Analytics 4. MLab - broadband connection performance (26B rows) 5. GSOD - samples of weather (rainfall, temp…) 6. US Natality - 1969 to 2008 7. Wikipedia edits BigQuery for the Everyday Developer @martonkodok BigQuery: Sample projects to try out
  25. 25. BigQuery for the Everyday Developer @martonkodok HttpArchive - multiple JS frameworks
  26. 26. BigQuery for the Everyday Developer @martonkodok HttpArchive - multiple jQuery versions
  27. 27. Questions? Thank you. Slides available soon on: slideshare.net/martonkodok

×