API analytics with Redis and Google Bigquery. NoSQL matters edition

2,403 views
2,205 views

Published on

At teowaki we have a system for API use analytics using Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise. In this session I will speak of the alternatives we evaluated and how we are using Redis and Bigquery to solve our problem.

API analytics with Redis and Google Bigquery. NoSQL matters edition

  1. 1. API Analytics with Redis and Google Bigquery javier ramirez @supercoco9
  2. 2. REST API + AngularJS web as an API client javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  3. 3. obvious solution: use a ready-made service as 3scale or apigee javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  4. 4. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  5. 5. javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  6. 6. data that’s an order of magnitude greater than data you’re accustomed to Doug Laney VP Research, Business Analytics and Performance Management at Gartner javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  7. 7. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  8. 8. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds Javier Ramirez impresionable teowaki founder javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  9. 9. javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  10. 10. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  11. 11. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. http://redis.io started in 2009 by Salvatore Sanfilippo @antirez 100 contributors at https://github.com/antirez/redis javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  12. 12. what is it used for javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  13. 13. twitter Every time line (800 tweets per user) is stored in redis 5000 writes per second avg 300K reads per second javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  14. 14. nginx + lua + redis apache + mruby + redis javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  15. 15. javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  16. 16. Redis keeps everything in memory all the time javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  17. 17. javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  18. 18. easy: store GZIPPED files into S3/Glacier * we are moving to google cloud now javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  19. 19. javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  20. 20. Hadoop (map/reduce) http://hadoop.apache.org/ started in 2005 by Doug Cutting and Mike Cafarella javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  21. 21. cassandra http://cassandra.apache.org/ released in 2008 by facebook. javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  22. 22. other big data solutions: hadoop+voldemort+kafka http://engineering.linkedin.com/projects hbase http://hbase.apache.org/ javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  23. 23. Amazon Redshift http://aws.amazon.com/redshift/ javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  24. 24. Our choice: google bigquery Data analysis as a service http://developers.google.com/bigquery javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  25. 25. Based on Dremel Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  26. 26. Columnar storage Easy to compress Convenient for querying long series over a single column javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  27. 27. loading data You can feed flat CSV files or nested JSON objects javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  28. 28. bq cli bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  29. 29. web console screenshot javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  30. 30. almost SQL select from join where group by having order limit javier ramirez @supercoco9 avg count max min sum + * / % https://teowaki.com & | ^ << >> ~ = != <> > < >= <= IN IS NULL BETWEEN AND OR NOT nosqlmatters 2013
  31. 31. Functions overview current_date current_time now datediff day day_of_week day_of_year hour minute quarter year... javier ramirez @supercoco9 abs acos atan ceil floor degrees log log2 log10 PI SQRT... https://teowaki.com concat contains left length lower upper lpad rpad right substr nosqlmatters 2013
  32. 32. analytics specific extensions within flatten nest stddev variance top first last nth var_pop var_samp covar_pop covar_samp quantiles javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  33. 33. window functions javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  34. 34. correlations javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  35. 35. Things you always wanted to try but were too scare to select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢) javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  36. 36. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  37. 37. country segmented traffic javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  38. 38. our most active user javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  39. 39. 10 request we should be caching javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  40. 40. 5 most created resources javier ramirez @supercoco9 http://teowaki.com nosqlmatters 2013
  41. 41. redis pricing 2* machines (master/slave) at digital ocean $10 monthly * we were already using these instances for a lot of redis use cases javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  42. 42. s3 pricing $0.095 per GB a gzipped 1.6 MB file stores 300K rows $0.0001541698 / monthly javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  43. 43. glacier pricing $0.01 per GB $0.000016 / monthly javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  44. 44. bigquery pricing $80 per stored TB 300000 rows => $0.007629392 / month $35 per processed TB 1 full scan = 84 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 10 GB => $0.35 / month javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  45. 45. redis s3 storage s3 transfer glacier transfer glacier storage bigquery storage bigquery queries $10.0000000000 $00.0001541698 $00.0050000000 $00.0500000000 $00.0000160000 $00.0076293920 $00.3500000000 $10.41 / month for our first 330000 rows javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  46. 46. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
  47. 47. Find related links at https://teowaki.com/teams/javier-community/link-categories/bigquery-talk Gràcies! Javier Ramírez @supercoco9 nosqlmatters 2013

×