Your SlideShare is downloading. ×
API analytics with Bigquery, by Javier Ramirez from teowaki
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

API analytics with Bigquery, by Javier Ramirez from teowaki

337
views

Published on

At https://teowaki.com we have a system for API usage analytics, with Redis as a fast intermediate store and google Bigquery as a big data backend. As a result, we can launch aggregated queries on our …

At https://teowaki.com we have a system for API usage analytics, with Redis as a fast intermediate store and google Bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.

In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

Published in: Technology, Business

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
337
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. javier ramirez @supercoco9 API Analytics with Redis, BigQuery, and AppsScripts
  • 2. a two people start-up
  • 3. a different league...
  • 4. .. or maybe not
  • 5. moral of the story you can do big, if you know how
  • 6. Set a distance. Set an expiration time. Bye bye noise.
  • 7. javier ramirez @supercoco9 https://teowaki.com REST API (Ruby on Rails) + Web on top (AngularJS)
  • 8. javier ramirez @supercoco9 https://teowaki.com
  • 9. data that’s an order of magnitude greater than data you’re accustomed to javier ramirez @supercoco9 https://teowaki.com Doug Laney VP Research, Business Analytics and Performance Management at Gartner
  • 10. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com
  • 11. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds javier ramirez @supercoco9 https://teowaki.com Javier Ramirez impresionable teowaki founder
  • 12. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  • 13. twitter stackoverflow pinterest booking.com World of Warcraft YouPorn HipChat Snapchat javier ramirez @supercoco9 https://teowaki.com ntopng LogStash
  • 14. javier ramirez @supercoco9 https://teowaki.com Non intrusive metrics Capture data really fast. Then process the data on the background
  • 15. javier ramirez @supercoco9 https://teowaki.com
  • 16. javier ramirez @supercoco9 https://teowaki.com
  • 17. Gzip to AWS S3/Glacier or Google Cloud Storage javier ramirez @supercoco9 https://teowaki.com
  • 18. javier ramirez @supercoco9 https://teowaki.com
  • 19. Hadoop Cassandra Hadoop + Voldemort + Kafka HBase … Amazon Redshift javier ramirez @supercoco9 https://teowaki.com tools we considered:
  • 20. but... hard to set up and monitor not interactive enough expensive cluster javier ramirez @supercoco9 https://teowaki.com
  • 21. Our choice: Google BigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com
  • 22. Based on “Dremel” Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com
  • 23. loading data You just send the data in text (or JSON) format javier ramirez @supercoco9 https://teowaki.com
  • 24. SQL javier ramirez @supercoco9 https://teowaki.com select name from USERS order by date; select count(*) from users; select max(date) from USERS; select sum(total) from ORDERS group by user;
  • 25. specific extensions for analytics javier ramirez @supercoco9 https://teowaki.com within flatten nest stddev top first last nth variance var_pop var_samp covar_pop covar_samp quantiles correlations
  • 26. Things you always wanted to try but were too scared to javier ramirez @supercoco9 https://teowaki.com select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
  • 27. columnar storage javier ramirez @supercoco9 https://teowaki.com
  • 28. highly distributed execution using a tree javier ramirez @supercoco9 https://teowaki.com
  • 29. web console screenshot javier ramirez @supercoco9 https://teowaki.com
  • 30. javier ramirez @supercoco9 https://teowaki.com country segmented traffic
  • 31. javier ramirez @supercoco9 https://teowaki.com window functions
  • 32. javier ramirez @supercoco9 https://teowaki.com our most active user
  • 33. new users per month
  • 34. javier ramirez @supercoco9 https://teowaki.com 10 request we should be caching
  • 35. javier ramirez @supercoco9 http://teowaki.com 5 most created resources select uri, count(*) total from stats where method = 'POST' group by URI;
  • 36. javier ramirez @supercoco9 http://teowaki.com ...but /users/javier/shouts /users/rgo/shouts /teams/javier-community/links /teams/nosqlmatters-cgn/links
  • 37. javier ramirez @supercoco9 http://teowaki.com 5 most created resources
  • 38. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  • 39. NO
  • 40. Automation with Apps Script Read from bigquery Create a spreadsheet on Drive E-mail it everyday as a PDF javier ramirez @supercoco9 https://teowaki.com
  • 41. bigquery pricing $26 per stored TB 1000000 rows => $0.00416 / month £0.00243 / month $5 per processed TB 1 full scan = 160 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 100 GB => $0.05 / month £0.03javier ramirez @supercoco9 https://teowaki.com
  • 42. £0.054307 / month* per 1MM rows *the 1st 1TB every month is free of charge javier ramirez @supercoco9 https://teowaki.com
  • 43. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com
  • 44. ig
  • 45. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Thanks! ‫תודה‬ Javier Ramírez @supercoco9