api analytics using Redis, BigQuery and AppsScripts by Javier Ramirez from teowaki (Apidays Mediterranea)

1,300 views
1,257 views

Published on

This is the story of how we implemented bigdata analytics for teowaki's api (https://teowaki.com)

Bigdata is amazing. You can get insights from your users, find interesting patterns and have lots of geek fun. Problem is big data usually means many servers, a complex set up, intensive monitoring and a steep learning curve. All those things cost money. If you don't have the money, you are losing all the fun. In my talk I will show you how you can use Redis, Google Bigquery and Apps Script to manage big data from your application for under $1 per month. Don't you feel like running a RegExp over 300 million rows in just 5 seconds?

Published in: Software
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,300
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

api analytics using Redis, BigQuery and AppsScripts by Javier Ramirez from teowaki (Apidays Mediterranea)

  1. 1. javier ramirez @supercoco9 API analytics with Redis and BigQuery
  2. 2. javier ramirez @supercoco9 https://teowaki.com api days 14 REST API (Ruby on Rails) + Web on top (AngularJS)
  3. 3. Use a hosted solution
  4. 4. questions? api days 14
  5. 5. javier ramirez @supercoco9 https://teowaki.com api days 14
  6. 6. data that’s an order of magnitude greater than data you’re accustomed to javier ramirez @supercoco9 https://teowaki.com api days2014 Doug Laney VP Research, Business Analytics and Performance Management at Gartner
  7. 7. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com api days2014
  8. 8. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds javier ramirez @supercoco9 https://teowaki.com api days2014 Javier Ramirez impresionable teowaki founder
  9. 9. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com api days 14
  10. 10. javier ramirez @supercoco9 https://teowaki.com api days2014
  11. 11. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets and hyperloglogs. http://redis.io started in 2009 by Salvatore Sanfilippo @antirez 112 contributors at https://github.com/antirez/redis javier ramirez @supercoco9 https://teowaki.com api days2014
  12. 12. twitter stackoverflow pinterest booking.com World of Warcraft YouPorn HipChat Snapchat javier ramirez @supercoco9 https://teowaki.com api days 14 ntopng LogStash
  13. 13. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9 https://teowaki.com api days2014
  14. 14. javier ramirez @supercoco9 https://teowaki.com api days 14 Non intrusive metrics Capture data really fast. Then send the data on the background
  15. 15. javier ramirez @supercoco9 https://teowaki.com api days2014
  16. 16. Redis keeps everything in memory all the time javier ramirez @supercoco9 https://teowaki.com api days2014
  17. 17. javier ramirez @supercoco9 https://teowaki.com api days 14
  18. 18. Gzip to AWS S3/Glacier or Google Cloud Storage javier ramirez @supercoco9 https://teowaki.com api days 14
  19. 19. javier ramirez @supercoco9 https://teowaki.com api days 14
  20. 20. Hadoop Cassandra Amazon Redshift ... javier ramirez @supercoco9 https://teowaki.com api days 14 tools we considered:
  21. 21. but... hard to set up and monitor expensive cluster not interactive enough javier ramirez @supercoco9 https://teowaki.com api days 14
  22. 22. Our choice: Google BigQuery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com api days 14
  23. 23. Based on “Dremel” Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com api days 14
  24. 24. loading data You just send the data in text (or JSON) format javier ramirez @supercoco9 https://teowaki.com api days 14
  25. 25. SQL javier ramirez @supercoco9 https://teowaki.com api days 14 select name from USERS order by date; select count(*) from users; select max(date) from USERS; select sum(total) from ORDERS group by user;
  26. 26. specific extensions for analytics javier ramirez @supercoco9 https://teowaki.com api days 14 within flatten nest stddev top first last nth variance var_pop var_samp covar_pop covar_samp quantiles
  27. 27. web console screenshot javier ramirez @supercoco9 https://teowaki.com api days 14
  28. 28. javier ramirez @supercoco9 https://teowaki.com api days 14 window functions
  29. 29. javier ramirez @supercoco9 https://teowaki.com api days 14 our most active user
  30. 30. javier ramirez @supercoco9 https://teowaki.com api days 14 country segmented traffic
  31. 31. javier ramirez @supercoco9 https://teowaki.com api days 14 10 request we should be caching
  32. 32. correlations. not to mistake with causality javier ramirez @supercoco9 https://teowaki.com api days 14
  33. 33. Things you always wanted to try but were too scared to javier ramirez @supercoco9 https://teowaki.com api days 14 select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
  34. 34. javier ramirez @supercoco9 http://teowaki.com api days2014 5 most created resources select uri, count(*) total from stats where method = 'POST' group by URI;
  35. 35. javier ramirez @supercoco9 http://teowaki.com api days2014 ...but /users/javier/shouts /users/rgo/shouts /teams/javier-community/links /teams/nosqlmatters-cgn/links
  36. 36. javier ramirez @supercoco9 http://teowaki.com api days2014 5 most created resources
  37. 37. new users per month
  38. 38. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  39. 39. Automation with Apps Script Read from bigquery Create a spreadsheet on Drive E-mail it everyday as a PDF javier ramirez @supercoco9 https://teowaki.com api days 14
  40. 40. bigquery pricing $26 per stored TB 1000000 rows => $0.00416 / month £0.00243 / month $5 per processed TB 1 full scan = 160 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 100 GB => $0.05 / month £0.03javier ramirez @supercoco9 https://teowaki.com api days 14
  41. 41. £0.054307 / month* per 1MM rows *the 1st 1TB every month are free of charge javier ramirez @supercoco9 https://teowaki.com api days 14
  42. 42. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com api days 14
  43. 43. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Thanks! Gr ciesà Javier Ramírez @supercoco9 api days 14

×