Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

api analytics redis bigquery. Lrug


Published on

At teowaki we have a system for API usage analytics, with Redis as a fast intermediate store and bigquery as a big data backend. As a result, we can launch aggregated queries on our traffic/usage data in just a few seconds and we can try and find for usage patterns that wouldn’t be obvious otherwise.

In this session I will talk about how we entered the Big Data world, which alternatives we evaluated, and how we are using Redis and Bigquery to solve our problem.

Published in: Technology, Design
  • Be the first to comment

api analytics redis bigquery. Lrug

  1. 1. API Analytics with Redis and Google Bigquery javier ramirez @supercoco9
  2. 2. REST API + AngularJS web as an API client javier ramirez @supercoco9
  3. 3. obvious solution: use a ready-made service as 3scale or apigee javier ramirez @supercoco9
  4. 4. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9
  5. 5. javier ramirez @supercoco9
  6. 6. data that’s an order of magnitude greater than data you’re accustomed to Doug Laney VP Research, Business Analytics and Performance Management at Gartner javier ramirez @supercoco9
  7. 7. data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures. Ed Dumbill program chair for the O’Reilly Strata Conference javier ramirez @supercoco9
  8. 8. bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds Javier Ramirez impresionable teowaki founder javier ramirez @supercoco9
  9. 9. javier ramirez @supercoco9
  10. 10. Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (with pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -P 16 -q SET: 552,028 requests per second GET: 707,463 requests per second LPUSH: 767,459 requests per second LPOP: 770,119 requests per second Intel(R) Xeon(R) CPU E5520 @ 2.27GHz (without pipelining) $ ./redis-benchmark -r 1000000 -n 2000000 -t get,set,lpush,lpop -q SET: 122,556 requests per second GET: 123,601 requests per second LPUSH: 136,752 requests per second LPOP: 132,424 requests per second javier ramirez @supercoco9
  11. 11. open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. started in 2009 by Salvatore Sanfilippo @antirez 100 contributors at javier ramirez @supercoco9
  12. 12. what is it used for javier ramirez @supercoco9
  13. 13. twitter Every time line (800 tweets per user) is stored in redis 5000 writes per second avg 300K reads per second javier ramirez @supercoco9
  14. 14. nginx + lua + redis apache + mruby + redis javier ramirez @supercoco9
  15. 15. javier ramirez @supercoco9
  16. 16. Redis keeps everything in memory all the time javier ramirez @supercoco9
  17. 17. javier ramirez @supercoco9
  18. 18. easy: store GZIPPED files into S3/Glacier * we are moving to google cloud now javier ramirez @supercoco9
  19. 19. javier ramirez @supercoco9
  20. 20. Hadoop (map/reduce) started in 2005 by Doug Cutting and Mike Cafarella javier ramirez @supercoco9
  21. 21. cassandra released in 2008 by facebook. javier ramirez @supercoco9
  22. 22. other big data solutions: hadoop+voldemort+kafka hbase javier ramirez @supercoco9
  23. 23. Amazon Redshift javier ramirez @supercoco9
  24. 24. Our choice: google bigquery Data analysis as a service javier ramirez @supercoco9
  25. 25. Based on Dremel Specifically designed for interactive queries over petabytes of real-time data javier ramirez @supercoco9
  26. 26. Columnar storage Easy to compress Convenient for querying long series over a single column javier ramirez @supercoco9
  27. 27. loading data You can feed flat CSV files or nested JSON objects javier ramirez @supercoco9
  28. 28. bq cli bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz javier ramirez @supercoco9
  29. 29. web console screenshot javier ramirez @supercoco9
  30. 30. almost SQL select from join where group by having order limit javier ramirez @supercoco9 avg count max min sum + * / % & | ^ << >> ~ = != <> > < >= <= IN IS NULL BETWEEN AND OR NOT
  31. 31. Functions overview current_date current_time now datediff day day_of_week day_of_year hour minute quarter year... javier ramirez @supercoco9 abs acos atan ceil floor degrees log log2 log10 PI SQRT... concat contains left length lower upper lpad rpad right substr
  32. 32. analytics specific extensions within flatten nest stddev variance top first last nth var_pop var_samp covar_pop covar_samp quantiles javier ramirez @supercoco9
  33. 33. window functions javier ramirez @supercoco9
  34. 34. correlations javier ramirez @supercoco9
  35. 35. Things you always wanted to try but were too scare to select count(*) from publicdata:samples.wikipedia where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0; 223,163,387 Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢) javier ramirez @supercoco9
  36. 36. SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt, repository_url FROM github.timeline WHERE type="WatchEvent" AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00") AND repository_url IN ( SELECT repository_url FROM github.timeline WHERE type="CreateEvent" AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00') AND repository_fork = "false" AND payload_ref_type = "repository" GROUP BY repository_url ) GROUP BY repository_name, repository_language, repository_description, repository_url HAVING cnt >= 5 ORDER BY cnt DESC LIMIT 25
  37. 37. country segmented traffic javier ramirez @supercoco9
  38. 38. our most active user javier ramirez @supercoco9
  39. 39. 10 request we should be caching javier ramirez @supercoco9
  40. 40. 5 most created resources javier ramirez @supercoco9
  41. 41. redis pricing 2* machines (master/slave) at digital ocean $10 monthly * we were already using these instances for a lot of redis use cases javier ramirez @supercoco9
  42. 42. s3 pricing $0.095 per GB a gzipped 1.6 MB file stores 300K rows $0.0001541698 / monthly javier ramirez @supercoco9
  43. 43. glacier pricing $0.01 per GB $0.000016 / monthly javier ramirez @supercoco9
  44. 44. bigquery pricing $80 per stored TB 300000 rows => $0.007629392 / month $35 per processed TB 1 full scan = 84 MB 1 count = 0 MB 1 full scan over 1 column = 5.4 MB 10 GB => $0.35 / month javier ramirez @supercoco9
  45. 45. redis s3 storage s3 transfer glacier transfer glacier storage bigquery storage bigquery queries $10.0000000000 $00.0001541698 $00.0050000000 $00.0500000000 $00.0000160000 $00.0076293920 $00.3500000000 $10.41 / month for our first 330000 rows javier ramirez @supercoco9
  46. 46. 1. non intrusive metrics 2. keep the history 3. avoid vendor lock-in 4. interactive queries 5. cheap 6. extra ball: real time javier ramirez @supercoco9
  47. 47. Find related links at Cheers! Javier Ramírez @supercoco9