Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

1,853 views

Published on

Big data is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don’t have the money, you are losing all the fun.

In my talk I show you how you can use Google BigQuery to manage big data from your application using a hosted solution. And you can start with less than $1 per month.

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,853
On SlideShare
0
From Embeds
0
Number of Embeds
109
Actions
Shares
0
Downloads
6
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Big Data with BigQuery, presented at DevoxxUK 2014 by Javier Ramirez from teowaki

  1. 1. @supercoco9#devoxxBigQuery Big Data with Google BigQuery Javier Ramirez @supercoco9 https://teowaki.com
  2. 2. @supercoco9#DevoxxBigquery Managing Big Data with BigQuery Javier Ramirez •Writing software since 1996 •Web dev. since 1999 (C++, JAVA, PHP, Ruby, JS...) •Founder of https://teowaki.com •Google Developer Expert on the Cloud Platform
  3. 3. @YourTwitterHandle@supercoco9#DevoxxBigquery B IG B IGD ATA D ATA
  4. 4. @YourTwitterHandle@supercoco9#DevoxxBigquery B IG B IGS ER V ER S S ER V ER S
  5. 5. @YourTwitterHandle@supercoco9#DevoxxBigquery B IG B IGD EV O P S D EV O P S
  6. 6. @YourTwitterHandle@supercoco9#DevoxxBigquery B IG B IGM O N EY M O N EY
  7. 7. bigdata is cool but... hard to set up and monitor expensive cluster not interactive enough
  8. 8. @supercoco9#DevoxxBigquery bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
  9. 9. Google BigQuery Data analysis as a service http://developers.google.com/bigquery
  10. 10. Based on “Dremel” Specifically designed for interactive queries over petabytes of real-time data
  11. 11. @supercoco9#DevoxxBigquery Your only worries •Load data •Query the dataset
  12. 12. loading data. You just send the data in text (or JSON) format
  13. 13. up to 100K inserts per second in stream mode
  14. 14. It's just SQL select name from USERS order by date; select count(*) from users; select max(date) from USERS; select sum(total) from ORDERS group by user;
  15. 15. @supercoco9#DevoxxBigquery Subselect and joins out of the box SELECT Year, Actor1Name, Actor2Name, Count FROM ( SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank FROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), WHERE Actor1Name IS NOT null AND Actor2Name IS NOT null GROUP EACH BY 1, 2, 3 HAVING Count > 100 ) WHERE rank=1 ORDER BY Year http://gdeltproject.org/data.html#googlebigquery
  16. 16. @supercoco9#DevoxxBigquery specific extensions for analytics within flatten nest stddev top first last nth variance var_pop var_samp covar_pop covar_samp quantiles correlations
  17. 17. Things you always wanted to try but were too scared to select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
  18. 18. columnar storage https://cookbook.experiencesaphana.com/crm/what-is-crm-on-hana/technology-innovation/row-vs-column-based/
  19. 19. highly distributed execution using a tree
  20. 20. web console screenshot
  21. 21. @supercoco9#DevoxxBigquery country segmented traffic
  22. 22. @supercoco9#DevoxxBigqueryjavier ramirez @supercoco9 https://teowaki.com window functions
  23. 23. @supercoco9#DevoxxBigquery our most active user
  24. 24. @supercoco9#DevoxxBigquery Worldwide events in the last 36 years SELECT Year, Actor1Name, Actor2Name, Count FROM ( SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rank FROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), WHERE Actor1Name IS NOT null AND Actor2Name IS NOT null GROUP EACH BY 1, 2, 3 HAVING Count > 100 ) WHERE rank=1 ORDER BY Year http://gdeltproject.org/data.html#googlebigquery
  25. 25. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  26. 26. @supercoco9#DevoxxBigquery
  27. 27. @supercoco9#DevoxxBigquery Automation with Apps Script ● Read from BigQuery ● Create a spreadsheet on Drive ● E-mail it everyday as a PDF https://developers.google.com/apps-script/
  28. 28. @supercoco9#DevoxxBigquery bigquery pricing $26 per stored TB 1000000 rows => $0.00416 / month £0.00243 / month $5 per processed TB 1 full scan = 160 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 100 GB => $0.05 / month £0.03 AppsScripts is for free
  29. 29. @supercoco9#DevoxxBigquery £0.054307 / month* per 1MM rows *the 1st 1TB every month is free of charge **assumming your rows have web server logs-like info price per month
  30. 30. @supercoco9#DevoxxBigquery ig
  31. 31. @YourTwitterHandle#DVXFR14{session hashtag} @supercoco9#devoxxBigquery TH A N K S ! Javier Ramirez @supercoco9 https://teowaki.com Related links at: https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
  32. 32. @supercoco9#DevoxxBigquery Thanks / Creative Commons •Presentation Template — Guillaume LaForge •The Queen — A prestigious heritage with some inspiration from The Sex Pistols and funny Devoxxians •Girl with a Balloon — Banksy •Tube — Michael Keen

×